Why Your LLM "Hallucinates" on PDFs’ images (And How to Fix It)
We use PDFs every day—they are the universal language of research papers and business reports. While current Large Language Models (LLMs) are incredibly powerful at understanding both text and vision, they have a "blind spot" when it comes to PDFs that contain critical visual data.
If you’ve ever noticed your model giving an incorrect answer about a chart in a research paper, the problem isn't necessarily the model's intelligence—it's the processing pipeline.
The Problem: Text Tokens vs. Visual Tokens
When you upload a PDF to tools like NotebookLM, Gemini, or Google AI Studio, the system decodes the file. However, if the PDF doesn't store images as specific XObjects (a technical standard for how PDFs embed images), the model often fails to "see" the image as a picture. Instead, it tries to interpret it as a messy string of text tokens.
Figure 1: How LLM process image file in pdf
In the example above, even SOTA multimodal models can fail. Because the image isn't converted into a visual token, the model loses the spatial context and the data relationships within the graphic, leading to errors in understanding.
The Solution: The "Image-First" Workflow
My current solution is simple but effective: Convert the PDF pages into images (screenshots) before sending them to the LLM.
By feeding the model an image file (like a PNG or JPEG), you force it to engage its vision capabilities. It no longer tries to "read" the image as code or text; it "looks" at it as a visual context, which significantly improves accuracy for charts, diagrams, and complex layouts.
Automating the Process
To make this convenient, I used Claude Code to build a local LLM workflow. This script handles the "grunt work" of converting my PDF pages into high-quality images instantly, so I can jump straight into analysis.
The Trade-Off: The Token Cost
While this method is much more accurate, there is one significant "con" to keep in mind: Token Usage.
When you process a document as an image, the model consumes far more tokens than it would for a standard text-based PDF. This is because visual tokens are more data-intensive than text tokens.
| File Type | Data Representation | Token Count |
| PDF (Text) | Text Tokens | Efficient / Low |
| Image (PNG) | Visual Tokens | High |
Figure 2: Token number comparison (Left: Image, Right: PDF)
As seen in Figure 2, the same content can take up vastly different amounts of your context window depending on the format.
Conclusion
If your PDF is 100% text, stick to the standard upload. But if you’re working with data-heavy research or technical diagrams, converting to images is the "pro move" to ensure your LLM actually sees what you're seeing.
Have you found a way to maintain visual accuracy without the high token overhead? I’d love to hear your thoughts or better methods in the comments!

